Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenize whitespace #1570

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

JohnnyMorganz
Copy link
Contributor

We would like to develop tooling that can consume the syntax tree and perform transformations whilst preserving the underlying structure of the source code. This requires the lexer / parser to retain enough information to be able to re-print the original code.

The first step towards this is to ensure the lexer preserves the input structure. In this commit, we enable tokenization of whitespace, stored in a new Lexeme of type Whitespace. Multiline whitespace is all stored in a single Lexeme.


if (updatePrevLocation)
prevLocation = lexeme.location;

lexeme = readNext();
updatePrevLocation = false;
} while (skipComments && (lexeme.type == Lexeme::Comment || lexeme.type == Lexeme::BlockComment));
} while (skipTrivia && (lexeme.type == Lexeme::Comment || lexeme.type == Lexeme::BlockComment || lexeme.type == Lexeme::Whitespace));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to merge skipComments together into a skipTrivia toggle

However, also open to splitting them into separate toggles if it makes more sense, e.g. skipWhitespace

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, given the test failures and the invasiveness of the change, decided to end up splitting into skipWhitespace separately.

But, if the inverse actually makes sense, we can change it :)

@JohnnyMorganz JohnnyMorganz marked this pull request as draft December 16, 2024 20:25
@vegorov-rbx
Copy link
Collaborator

Changes to lexer/parser on the hot path require benchmarking.

If you don't have a large codebase to do it on, we can measure it for you, but it will take time.

@JohnnyMorganz
Copy link
Contributor Author

Some benchmarking after running on a ~1400 KLOC repository. This is 10 runs of .luau-compile --null . and reading the time for parsing that is printed (note: which is 2dp)

Command Mean ± std.dev [s] Min [s] Max [s]
./luau-compile.old --null . 0.416 ± 0.017 0.39 0.45
./luau-compile.new --null --fflags=LuauLexerTokenizesWhitespace=False . 0.423 ± 0.021 0.39 0.46
./luau-compile.new --null --fflags=LuauLexerTokenizesWhitespace=True . 0.447 ± 0.012 0.43 0.46

And decided to run in hyperfine if it's a helpful comparison:

$ hyperfine -N --warmup 1 "./luau-compile.old --null ." "./luau-compile.new --null --fflags=LuauLexerTokenizesWhitespace=False ." "./luau-compile.new --null --fflags=LuauLexerTokenizesWhitespace=True ."
Benchmark 1: ./luau-compile.old --null .
  Time (mean ± σ):      2.043 s ±  0.048 s    [User: 1.083 s, System: 0.648 s]
  Range (min … max):    1.947 s …  2.121 s    10 runs

Benchmark 2: ./luau-compile.new --null --fflags=LuauLexerTokenizesWhitespace=False .
  Time (mean ± σ):      2.062 s ±  0.066 s    [User: 1.091 s, System: 0.648 s]
  Range (min … max):    1.979 s …  2.172 s    10 runs

Benchmark 3: ./luau-compile.new --null --fflags=LuauLexerTokenizesWhitespace=True .
  Time (mean ± σ):      2.077 s ±  0.064 s    [User: 1.114 s, System: 0.648 s]
  Range (min … max):    2.026 s …  2.249 s    10 runs

Summary
  ./luau-compile.old --null . ran
    1.01 ± 0.04 times faster than ./luau-compile.new --null --fflags=LuauLexerTokenizesWhitespace=False .
    1.02 ± 0.04 times faster than ./luau-compile.new --null --fflags=LuauLexerTokenizesWhitespace=True .

If I keep running it though the numbers do fluctuate. Let me know if I should do something different to benchmark

@JohnnyMorganz JohnnyMorganz marked this pull request as ready for review December 17, 2024 21:22
Copy link
Collaborator

@vegorov-rbx vegorov-rbx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it makes sense to preserve specific whitespace given the overhead.

We already able to reproduce files with original formatting mostly only given the location information (that's how Transpiler works).

I guess we are losing some info like mixed tabs/spaces and maybe garbage on the end of lines.

Since whitespace lexemes are not even captured yet, I wonder if you can fill in the blanks in a separate pass based on that location information of the nodes.

Or maybe we don't need to even build that structure and when we output the whitespace we can go and look what kind of symbols were used in the original file (even if AST nodes were moved around it should be reconstructable).

It might be also be acceptable in the current form if maybe we optimize the thing that is causing the perf overhead (maybe instead of checking 5 token types we can pack those enum values together and do unsigned(token - unimportant_start) <= (unimportant_last - unimportant_start) or something else.

@JohnnyMorganz
Copy link
Contributor Author

Thanks, this makes sense, will take a look at this. For context, my original thinking was to make the lexer retain all source information so that given a stream of just lexer tokens, we could recreate the original input. Then, given an AST node and a set of related lexer tokens (would have to be stored during parsing), we'd be able to recreate exactly how that node was written.

Indeed, this would be important so that we could preserve the exact style of whitespace (spaces vs. tabs, leading/trailing whitespace, etc.).

But, the general idea of "filling in the gaps" with just location info is interesting. I did not explore this much, I'll try playing around with that further.

Or maybe we don't need to even build that structure and when we output the whitespace we can go and look what kind of symbols were used in the original file (even if AST nodes were moved around it should be reconstructable).

This would probably be good if the end goal was just to reconstruct the input, but I imagine if we did not build some sort of structure then the interface to perform code transforms would be a lot harder to work with, especially if you want to modify how the whitespace is. I can see it being doable though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants